Scalable Similarity Search for Molecular Descriptors
نویسندگان
چکیده
Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for searching databases of binary vectors, solutions for more general integer vectors are in their infancy. In this paper we present a timeand spaceefficient index for the problem that we call the succinct intervals-splitting tree algorithm for molecular descriptors (SITAd). Our approach extends efficient methods for binary-vector databases, and uses ideas from succinct data structures. Our experiments, on a large database of over 40 million compounds, show SITAd significantly outperforms alternative approaches in practice.
منابع مشابه
Text Based Approaches for Content Based Image Retrieval in a P2P Network
The tremendous growth of digital multimedia content on the web requires scalable, efficient, and effective information retrieval mechanisms. Handling such large collections of data in a centralized way requires costly high bandwidth connectivity and powerful servers. This establishes the need of distributed architectures, such as peer-to-peer systems, that allow sharing of data management and s...
متن کاملFlexible Similarity Search of Semantic Vectors Using Fulltext Search Engines
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. In our recent research we proposed a novel approach to ‘vector similarity searching’ over dense semantic vector representations. This approach can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity....
متن کاملScalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets
Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Webscale applications, but most existing methods a...
متن کاملPP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search
We present the Permutation Prefix Index (PP-Index), an index data structure that allows to perform efficient approximate similarity search. The PP-Index belongs to the family of the permutationbased indexes, which are based on representing any indexed object with “its view of the surrounding world”, i.e., a list of the elements of a set of reference objects sorted by their distance order with r...
متن کاملNonlocal Similarity Image Filtering
We exploit the recurrence of structures at different locations, orientations and scales in an image to perform denoising. While previous methods based on “nonlocal filtering” identify corresponding patches only up to translations, we consider more general similarity transformations. Due to the additional computational burden, we break the problem down into two steps: First, we extract similarit...
متن کامل